Steps for verification of Clustered Media Server
These are the steps I take after installing a clustered media server (which is really 2 different clusters, and multiple other servers)
There is a lot of complexity to the over all system, so its easier to walk through and confirm each system individually, and confirm its function than to just install everything and try to go straight to the end result.
Assumption:
For this step by step verification procedure, the following is the assumed installation scenario. The steps could be somewhat different if a different installation configuration is different
Configuration / Servers: |
|
Machine A |
|
PR1 | Proxy Regisitrar Primary Node |
WASPR1 | PR Cluster WAS Proxy 1 |
CM1 | Conference Manager Primary Node |
WASCM1 | CM Cluster WAS Proxy 1 |
PS1 | PacketSwitch 1 (installed via Primary Node install) |
Machine B |
|
PR2 | Proxy Registrar Secondary Node |
WASPR2 | PR Cluster WAS Proxy 2 (see Note) |
CM2 | Conference Manager Secondary Node 2 |
WASCM2 | CM Cluster WAS Proxy 2 |
PS2 | PacketSwitch 2 (installed via Primary Node install) |
Note: If using TLS, then WASPR1 and WASPR2 should never be started at the same time. Only one of these should be running at a time. WASPR2 should be considered a cold standby server.
Steps
Install all of the servers as per installation Infocenter instructions, cluster them, import certs, etc.
The starting point for these next instructions assume that has already been done for all sever instances, before beginning. All other install steps should be taken too - consider the below added instructions instead of a complete set of instructions.
Do not Federate the PS instances.
Verify prereq configuration
In order for AV to function correctly, all users must have an email address in LDAP, and the Directory Info service in Community server must be functional and working correctly.
The easiest way to verify this is to hover over someone in the buddy list, and do a Business Card lookup for the person. If there is no email address associated with the user, the either the Directory Info Service is not working, or they don't have an email address in LDAP. Either way, there is no point to going further without having these requirements met.
See the Community server docs to fix the Directory Info Service. Do not proceed until these are corrected.
Verify configurations / customize for simplicity:
I like to do these steps to help with the verification process, and it also makes debugging the system easier. While not required for an installation to work, I consider them required so that you don't loose your sanity while debugging.
Stop all servers listed above.
During the install of the servers, the installations will just pick the next available, unused ports. That can mean that different servers, of the same type, are actually all using different sets of ports for SIP and SIPS traffic.
Now, with all servers stopped, you can go through each server, and verify that the SIP / SIPS ports are equal for all nodes of the same type. For example SIP / SIPS for the two PR nodes might be 5062 and 5063.
Keep the SIP ports even, and the SIPS ports odd - keep them right next to each other.
Here's one layout - go through, and record / set the values for all servers.
Server | SIP | SIPS | HTTP |
WASPR1 | 5060 | 5061 | 9080 |
PR1 | 5062 | 5063 | 9081 |
WASCM1 | 5064 | 5065 | 9082 |
CM1 | 5066 | 5067 | 9083 |
PS1 | 5068 | 5069 | 9084 |
| | |
|
WASPR2 | 5060 | 5061 | 9080 |
PR2 | 5062 | 5063 | 9081 |
WASCM2 | 5064 | 5065 | 9082 |
CM2 | 5066 | 5067 | 9083 |
PS2 | 5068 | 5069 | 9084 |
The values can be checked on the servers config page -> Ports.
For the normal nodes, the values are the settings for:SIP_DEFAULTHOST
SIP_DEFAULTHOST_SECURE
for the WAS Proxies, the values are:PROXY_SIP_ADDRESS
PROXY_SIPS_ADDRESS
Now, all of the servers of the same type/role have exactly the same ports.
Verify all used ports are in the Virtual Hosts list:
There are 2 virtual hosts values used - sip_proxyreg_host and default_host
Go to the Virtual Hosts administration page, and add ALL of the ports above to both of these entries. While not technically you only need the PR ports in the sip_proxyreg_host and the CM and PS hosts in the default_host, I've found it more simple to add all ports to both. It will help remove frustration later.
Create a single unified stavconfig.xml file, and push it to all servers.
What happens during the install is that each server will get a stavconfig.xml file. While the files are almost the same for all servers, the installer actually leaves out the information that some servers don't need. For example, the PR doesn't need to know the community server port, so it doesn't include it on the PR's copy. The PR also doesn't need the CM info, so its also out of its copy. But, you can create a single file with all required info in it, so it can be used on every single server.
The easiest way is to use the PS's file as a starting point. Copy it back to your workstation, then
verify these values:
<configuration lastUpdated="1226425838277" name="STCommunityServerHost" value="<community.server.hostname>"/>
<configuration lastUpdated="1226425838277" name="STCommunityServerPort" value="1516"/>
<configuration lastUpdated="1226425838277" name="ConferenceServerHost" value="<machineA IP, or LB IP>"/>
<configuration lastUpdated="1226425838277" name="ConferenceServerPort" value="5065 <WASCM1 SIPS>"/>
<configuration lastUpdated="1226425838277" name="SIPProxyServerHost" value="<machine A IP, or LB IP>"/>
<configuration lastUpdated="1226425838277" name="SIPProxyServerPort" value="5060 <WASPR1 SIPS>"/>
The values used here should be the entry point for the cluster, so either the LB or one of the WAS Proxies.
The Packet Switch sections should be copied into a single document, IE, so it looks like this:
<packetswitches>
<packetswitch host="machineA" port="5069" id="PacketSwitch1" serverName="server1" transportProtocol="UDP"
portIsMultiple="1" singleAudioPort="39000" singleVideoPort="40000"
startingAudioPortRange="42000" endingAudioPortRange="43000"
startingVideoPortRange="46000" endingVideoPortRange="47000" capacity="200000" udpInboundPort="55555"/>
<packetswitch host="machineB" port="5069" id="PacketSwitch2" serverName="server2" transportProtocol="UDP"
portIsMultiple="1" singleAudioPort="39000" singleVideoPort="40000"
startingAudioPortRange="42000" endingAudioPortRange="43000"
startingVideoPortRange="46000" endingVideoPortRange="47000" capacity="200000" udpInboundPort="55555"/>
</packetswitches>
Once the single, unified stavconfig.xml file has been created, replace all of the copies with this single version.
On the DM, you can search through the DM's copies of the clustered nodes, replace the copies there, and then resync the files out ot the nodes. For the PS vesions of the files, you will have to replace them directly on the nodes. Verify that the unified copy is correct on all machines, and all values are correct
System Verification
Single nodes (#1 Nodes)
At this point, you are ready to start of the servers, and verify the functions of each one. I like to do this in steps, so I know each server is working correctly before moving to the next server. This prevents starting up everything, seeing that it doesn't work, and then having no idea where to start debugging.
After they are started, go to the PR status page, to make sure it is configured and running correctly:
http://machineA:9081/Registrar/RegistrationsTable.jsp
You should see a table with a list of registrations. The only registered SIP entry should be the PR itself.
After these two servers are started, the CM should register with the PR. Go reload the PR status page above, and make sure the CM has registered correctly.
You can now also view the status page of the CM:http://machineA:9083/ConferenceFocus/ConferenceFocusStatus
If the CM is not registered with the PR, don't go further until you figure out why.
Once the PR and the CM are started and OK, start the PS. Reload the PR status page, and you should see the PS listed in the registrations.
Reload the CM status page, and you should see the PS is listed as a valid usable packet switch
You can load the PS status page, to figure out its status:http://machineA:9084/STAVPacketSwitch/PacketSwitchStatus.jsp
If using 8.5.2 instead of 8.5.1, use this page for ps status:http://machineA:9084/stav/PacketSwitchStatus.jsp
Once all of these are working and showing the correct status, then try to test the client.
Single Nodes (#2 nodes)
Once everything is working with the #1 nodes, I like to confirm that everything works with the #2 nodes.
Shutdown all of the nodes. Perform the steps above, but with the #2 nodes.
All nodes, HA cluster enabled.
Once the #2 set of nodes has been confirmed - then, and only then, you can start up all of the nodes and have them running at
a single time. (See notes above about WASPR2, though - 2 WAS Proxy PR nodes are not allowed to be running at the same time)
Debugging
Hostname / ports incorrect:
When there are issues, a common cause is that the information in the stavconfig.xml file used by the servers is incorrect.
Each server will write to a file in its log directory a *.info file, which will specify exactly which parameters it is using. The conference manager, as an example, will write a ConferenceManager.info file.
Verify in these files that the values the servers are using are what you expect.
Certificates not exchanged correctly
The PS requires that the certs for the PR (and thus, the SSC / DM) be imported. Be sure to follow these steps in the information center if the PS won't register with the PR.
Notes on Load Balancers
Never use a LB on first configuration. Setup everything and test everything according to the instructions above.
Then - after it is known to be working - then go back and reconfigure for a LB. You will need to edit the stavconfig.xml file, and push that new version of the file out to all servers.
It is much easier to move from a working system, and add the Load Balancer, than to try to debug a system that has never worked and has a load balancer in front of it.
Common Issues
These are the most common issues that we have run into during an installation of 8.5.1 / 8.5.2 clusters. Problems specific to a version are listed.
stavconfig.xml files overwritten
Sometimes weird things will start to happen when to your stavconfig.xml files. Make sure your machines all have the same timezone, and they should all be using NTP or have very close clocks.
Always sync from the DM to the nodes. The PS will be manual update.
SIPS port == -1 in stavconfig.xml
There is a known problem with even the latest 8.5.1.1 that, on an update of some value in SSC for the configuration values of the Media Server, the port number used for secure SIP traffic to the PR will be overwritten with the value of -1.
There is a hotfix available, and should be applied to SSC. The (temporary) workaround is to not touch any values in SSC for the Media Manager. (Even pressing OK on that page will cause the rewrite.)
Hostnames with sip at the beginning
There is a known problem with 8.5.1.0 where hostnames that begin with sip will cause issues. For example siphost.ibm.com.
If these host names are in use, apply 8.5.1.1 server and client fixes.
Version numbers
The .info files in the server logs include the version number of the build. Verify these on all servers that they are the version that you think they are. 8.5.1.0 works out of the box for simple scenarios, but 8.5.1.1 fixes several edge cases.
IP6
IP6 wil absolutely cause you problems. It should be disabled. Use netstat to verify that IP6 has been disabled.
If it is not possible to disable IP6, then a special actions can be taken to disable IP6 in WAS.
You need to add the property java.net.preferIPv4Stack with the value of true to all of the WAS processes. The following locations include all WAS processes:
Application server | Application servers > click server > In the Server Infrastructure section, click Java and process management > Process definition > Java virtual machine > Custom Properties |
WAS SIP proxy | Proxy servers > click WAS SIP proxy server >In the Server Infrastructure section, click Java and process management > Process definition > Java virtual machine > Custom Properties |
Deployment manager | System Administration > Deployment manager > Java and process management > Process definition > Java virtual machine > Custom Properties |
Node agent | System Administration > Node agent > nodeagent > Java and process management > Process definition > Java virtual machine > Custom Properties |
Calls cut off after 2-3 minutes
There is a known problem that occurs with 8.5.1.X if the PR entry point for SIP traffic is 5060, and TLS is not being used. The issue is that the sip: URLs don't include the port number (because 5060 is the default port number for sip) and it causes a string not to match.
The CM will never see the responses to an UPDATE request to the client, and will kill all calls after 2 1/2 minutes, thinking the client has gone off line.
The solution to this problem is to move the port for the PR SIP entry point to any other port other than 5060, or to disable the check ping requests to the client.
To disable the check that the client is still online (disabling the UPDATE request), change the SessionExpiry to 0.
Change this: <configuration lastUpdated="1226425838277" name="SessionExpiry" value="150"/>
to <configuration lastUpdated="1226425838277" name="SessionExpiry" value="0"/>
Calls missing video / video only in one direction
Firewalls. Firewalls. Make sure that traffic is being allowed to flow on the ports used for video in both directions. Many firewalls do strange checking of UDP traffic - for example only allowing UDP traffic if its flowing in both directions.
Wireshark is the easiest way to confirm that packets are really flowing correctly between client -> PS -> client.
Enabling traces on a WebSphere SIP proxy server
To debug issues at the WebSphere SIP Proxy, we normally enable these traces
*=info: com.ibm.ws.sip.*=all: com.ibm.ws.proxy.*=all
However, when these traces are enabled, the WebSphere SIP proxy performs reverse DNS lookups and causes delay of a few minutes. As a result, SIP requests fail with a timeout (408 error).
If you need to debug issues at the WebSphere SIP proxy server, use the following trace filter
*=info: com.ibm.ws.sip.*=all: com.ibm.ws.proxy.*=all: com.ibm.ws.proxy.channel.sip.SipProxyConnection=off